Background

This is the second assignment for the BCB420H1 course @ U of T. In this assignment, we will be parsing through our cleaned and normalized dataset generated in A1, and performing tests to identify differentially expressed genes and perform preliminary over-representation analyses.

Loading in Libraries

First, let’s load in all of our libraries.

Assignment 1 Recap

We are working with the GSE205517 dataset. This dataset and its accompanying study investigates the differentiation of hPSCs into left ventricular cardiomyocytes, and compares them to patient derived samples. (Dark et al. 2023) The study compares the transcriptomes of both conditions in order to determine similarity for potential therapeutic purposes. This study is a SuperSeries containing the SubSeries: - GSE203375, containing the hPSCs - GSE204885, containing the patient samples We will download both. In the previous assignment, we performed normalization on the RNA-Seq data, along with remapping HGNC symbols to avoid collisions. In this assignment, we will be perform differential gene expression (DGE) analysis, along with over-representation analysis (ORA). Before we begin these new steps, we will run the data acquisition, data overview, and normalization steps, and overview stats from assignment 1.

Data Acquisition

First, let’s perform our data acquisition step using GEOQuery. (Davis and Meltzer 2007) We’ll be re-using code blocks from A1.

# GEO Accession numbers
geo_acc_1 <- "GSE203375"
geo_acc_2 <- "GSE204885"

# The filenames for saving/loading data
filename_1 <- paste0(geo_acc_1, ".RData")
filename_2 <- paste0(geo_acc_2, ".RData")

# Reading in files from the GEO or locally
if (!file.exists(filename_1)) {
  gset_1 <- getGEO(geo_acc_1, GSEMatrix=TRUE, getGPL=FALSE)
  saveRDS(gset_1, filename_1)
} else {
  gset_1 <- readRDS(filename_1)
}
gset_1 <- gset_1[[1]]

if (!file.exists(filename_2)) {
  gset_2 <- getGEO(geo_acc_2, GSEMatrix=TRUE, getGPL=FALSE)
  saveRDS(gset_2, filename_2)
} else {
  gset_2 <- readRDS(filename_2)
}
gset_2 <- gset_2[[1]]

We now have our data loaded in. We now need to retrieve the counts tables from the GEO.

# load counts table from GEO
urld <- "https://www.ncbi.nlm.nih.gov/geo/download/?format=file&type=rnaseq_counts"
path_1 <- paste(urld, paste0("acc=", geo_acc_1), paste0("file=", geo_acc_1, "_raw_counts_GRCh38.p13_NCBI.tsv.gz"), sep="&");
path_2 <- paste(urld, paste0("acc=", geo_acc_2), paste0("file=", geo_acc_2, "_raw_counts_GRCh38.p13_NCBI.tsv.gz"), sep="&");

# This checks if we already have the previous matrices saved in or not. If we have a local copy, we don't re-download. This is a time-saving and efficiency measure.
if (!file.exists(paste0(geo_acc_1,"_raw_counts.RData"))) {
  counts_data_1 <- as.matrix(data.table::fread(path_1, header=T, colClasses="integer"), rownames=1)
  name_mapping <- setNames(gset_1@phenoData@data[["title"]], gset_1@phenoData@data[["geo_accession"]])
  colnames(counts_data_1) <- name_mapping[colnames(counts_data_1)]
  saveRDS(counts_data_1, paste0(geo_acc_1, "_raw_counts.RData"))
} else {
  counts_data_1 <- readRDS(paste0(geo_acc_1, "_raw_counts.RData"))
}

# We do the same for our second dataset

if (!file.exists(paste0(geo_acc_2,"_raw_counts.RData"))) {
  counts_data_2 <- as.matrix(data.table::fread(path_2, header=T, colClasses="integer"), rownames=1)
  name_mapping <- setNames(gset_2@phenoData@data[["title"]], gset_2@phenoData@data[["geo_accession"]])
  colnames(counts_data_2) <- name_mapping[colnames(counts_data_2)]
  saveRDS(counts_data_2, paste0(geo_acc_2, "_raw_counts.RData"))
} else {
  counts_data_2 <- readRDS(paste0(geo_acc_2, "_raw_counts.RData"))
}
counts_data <- cbind(counts_data_1, counts_data_2)

Briefly Explaining our Data

This excerpt is directly from A1: Some basic information about the dataset can be obtained through the ExpressionSet we’ve read in. The study investigated differentiation of two different cell lines, H9 of karyotype 46, XX, and MLC2V of karyotype 46, XY, corresponding to cells originating from female and male sources, respectively. It then compared to to patient samples harvested from left and right atrial and ventricular tissue, along with cardiomyocytes individually harvested from each section of the heart from three patients each.

Below are the experimental details provided in the gset objects for both series:

gset_1@experimentData@title # study title
## [1] "Bulk RNA-seq of a time series of hPSCs as they differentiate towards left ventricle cardiomyocytes"
gset_1@experimentData@abstract # study abstract
## [1] "We report that both hESCs and hiPSCs can be differentiated towrads left ventricle cardiomyocytes using an optimised protocol and that these cells transit via first heart field progenitors."
gset_1@experimentData@other$overall_design # summary of experiment
## [1] "Examination of the transcriptome of two hPSC lines as they differentiate towards left ventricle cardiomyocytes; critical time points were collected"
unique(gset_1@phenoData@data[["extract_protocol_ch1.1"]]) # platform
## [1] "KAPA mRNA (polyA) HyperPrep"
unique(gset_1@phenoData@data[["experiment number:ch1"]]) # heart
## [1] "h28" "h30" "h32" "h41" "h18" "h42"
unique(gset_1@phenoData@data[["cell_line:ch1"]]) # cell lines
## [1] "H9"    "MLC2V"
unique(gset_1@phenoData@data[["day:ch1"]]) # day of differentiation
## [1] "0"  "2"  "4"  "10" "15" "20" "40" "60" "6"
gset_1@experimentData@other$platform_id # GPL id
## [1] "GPL20301"
gset_1@experimentData@other$last_update_date # last update date
## [1] "Aug 08 2023"
unique(gset_1@phenoData@data$organism_ch1) # organism
## [1] "Homo sapiens"
gset_2@experimentData@title # study title
## [1] "Bulk-RNA-sequencing from chamber specific heart tissue and isolated cardiomyocytes"
gset_2@experimentData@abstract # study abstract
## [1] "Healthy hearts from warm cadavers not usable for heart transplants were identified and tissues were isolated from these. From ventricle tissues, cardiomyocytes were further isolated. Bulk-RNA-sequencing was performed."
gset_2@experimentData@other$overall_design # summary of experiment
## [1] "Bulk-RNA-sequencing from 3 different donors (tissue) and 3 other donors (isolated CMs)"
unique(gset_2@phenoData@data[["extract_protocol_ch1.1"]]) # platform
## [1] "KAPA mRNA (polyA) HyperPrep"
unique(gset_2@phenoData@data[["chamber:ch1"]]) # cell lines
## [1] "LV" "RV" "LA" "RA"
unique(gset_2@phenoData@data[["patient:ch1"]]) # patient
## [1] "P1" "P2" "P3" "P4" "P5" "P6"
unique(gset_2@phenoData@data[["tissue:ch1"]]) # tissue
## [1] "Ventricle Isolated Cells" "Heart tissue"
gset_2@experimentData@other$platform_id # GPL id
## [1] "GPL20301"
gset_2@experimentData@other$last_update_date # last update date
## [1] "Apr 26 2023"
unique(gset_2@phenoData@data$organism_ch1) # organism
## [1] "Homo sapiens"
sample_info_1 <- gset_1@phenoData@data[ , (ncol(gset_1@phenoData@data)-2):ncol(gset_1@phenoData@data)]
colnames(sample_info_1) <- gsub(":ch1", "", colnames(sample_info_1)) 
knitr::kable(sample_info_1, format = "pipe", caption = "<b>Table 1:</b> Sample information for Cardiomyocytes Derived from Stem Cells")
Table 1: Sample information for Cardiomyocytes Derived from Stem Cells
cell_line day experiment number
GSM6170585 H9 0 h28
GSM6170586 H9 2 h28
GSM6170587 H9 4 h28
GSM6170588 H9 10 h28
GSM6170589 H9 15 h28
GSM6170590 H9 20 h28
GSM6170591 H9 40 h28
GSM6170592 H9 60 h28
GSM6170593 H9 0 h30
GSM6170594 H9 2 h30
GSM6170595 H9 4 h30
GSM6170596 H9 6 h30
GSM6170597 H9 10 h30
GSM6170598 H9 15 h30
GSM6170599 H9 20 h30
GSM6170600 H9 40 h30
GSM6170601 H9 60 h30
GSM6170602 H9 0 h32
GSM6170603 H9 2 h32
GSM6170604 H9 4 h32
GSM6170605 H9 6 h32
GSM6170606 H9 15 h32
GSM6170607 H9 20 h32
GSM6170608 H9 40 h32
GSM6170609 H9 60 h32
GSM6170610 H9 0 h41
GSM6170611 H9 2 h41
GSM6170612 H9 4 h41
GSM6170613 H9 6 h41
GSM6170614 H9 10 h41
GSM6170615 H9 15 h41
GSM6170616 H9 20 h41
GSM6170617 H9 40 h41
GSM6170618 MLC2V 0 h18
GSM6170619 MLC2V 2 h18
GSM6170620 MLC2V 4 h18
GSM6170621 MLC2V 6 h18
GSM6170622 MLC2V 10 h18
GSM6170623 MLC2V 15 h18
GSM6170624 MLC2V 20 h18
GSM6170625 MLC2V 40 h18
GSM6170626 MLC2V 60 h18
GSM6170627 MLC2V 0 h42
GSM6170628 MLC2V 2 h42
GSM6170629 MLC2V 4 h42
GSM6170630 MLC2V 6 h42
GSM6170631 MLC2V 10 h42
GSM6170632 MLC2V 15 h42
GSM6170633 MLC2V 20 h42
GSM6170634 MLC2V 40 h42
GSM6170635 MLC2V 60 h42
sample_info_2 <- gset_2@phenoData@data[ , (ncol(gset_2@phenoData@data)-2):ncol(gset_2@phenoData@data)]
colnames(sample_info_2) <- gsub(":ch1", "", colnames(sample_info_2)) 
knitr::kable(sample_info_2, format = "pipe", caption = "<b>Table 2:</b> Sample information for Patient-Derived Samples")
Table 2: Sample information for Patient-Derived Samples
chamber patient tissue
GSM6199297 LV P1 Ventricle Isolated Cells
GSM6199298 RV P1 Ventricle Isolated Cells
GSM6199299 LV P2 Ventricle Isolated Cells
GSM6199300 RV P2 Ventricle Isolated Cells
GSM6199301 LV P3 Ventricle Isolated Cells
GSM6199302 RV P3 Ventricle Isolated Cells
GSM6199303 LA P4 Heart tissue
GSM6199304 LV P4 Heart tissue
GSM6199305 RA P4 Heart tissue
GSM6199306 RV P4 Heart tissue
GSM6199307 LA P5 Heart tissue
GSM6199308 LV P5 Heart tissue
GSM6199309 RA P5 Heart tissue
GSM6199310 RV P5 Heart tissue
GSM6199311 LA P6 Heart tissue
GSM6199312 LV P6 Heart tissue
GSM6199313 RA P6 Heart tissue
GSM6199314 RV P6 Heart tissue

Data-Cleaning

To clean our data, we will do two things: 1. Filter out zero-expressors 2. Perform HGNC re-mapping 3. Normalize the data

Filtering Out Zero-Expressors

From A1, we know that there are genes that minimally express throughout different conditions, and as a result, we will be filtering those out.

# Pick out all rows that have zeroes across all conditions
rows_to_keep <- apply(counts_data, 1, function(row) any(row != 0))

# Create new matrix containing only the rows that aren't all zeroes
counts_data_zero_filtered <- counts_data[rows_to_keep, ]

Data Normalization

In line with the experimenters’ analyses, we will normalize the data together

First, let’s perform CPM filtering to remove low expressors.

# Minimum number of samples for the hPSC run
min_num_samples <- 2
counts_data_zero_filtered_matrix <- as.matrix(counts_data_zero_filtered)

# Get rid of low counts
# We add 0.1 for numerical stability and to prevent NA values when evaluating logs
keep = rowSums(log2(cpm(counts_data_zero_filtered_matrix+0.1)) > 1) > min_num_samples
filtered_counts_matrix = counts_data_zero_filtered_matrix[keep,]

After, we will use TMM from edgeR (Robinson, McCarthy, and Smyth 2010) to perform our normalization steps.

# We will make our DGEList with the filtered count matrices
d <- DGEList(counts=filtered_counts_matrix)

# Calculate normalization factors
d <- calcNormFactors(d)

# Apply normalization
normalized_counts <- cpm(d)

Normalization is now done, and we can move onto HGNC mapping.

HGNC Mapping

The protocol followed for HGNC mapping is nearly identical to the first assignment. First, we will add a column to our normalized with the appropriate NCBI gene ID. While we do have the annotation table, it is also true that we were specified to still manually perform this step. We will download a copy of the HGNC dataset. (Tweedie et al. 2020) The dataset is confirmed to be up-to-date with the date of download.

# Download the HGNC file if it doesn't already exist in the cwd
if (!file.exists("hgnc_genes.RData")) {
  hgnc_genes <- import_hgnc_dataset(file=latest_archive_url())
  saveRDS(hgnc_genes, "hgnc_genes.RData")
} else {
  hgnc_genes <- readRDS("hgnc_genes.RData")
}
hgnc_mapping <- hgnc_genes[, c('symbol', 'entrez_id')]

Now that we have a mapping, we can apply this to the table.

# Convert the normalized counts into a dataframe
normalized_counts_df <- as.data.frame(normalized_counts)

# Add rownames
normalized_counts_df$NCBI_gene_id <- rownames(normalized_counts_df)

# Map the entrez_id to the NCBI_gnes
labelled_counts_data <- merge(normalized_counts_df, hgnc_mapping, by.x = "NCBI_gene_id", by.y = 'entrez_id', all.x = TRUE)
labelled_counts_data <- labelled_counts_data[,c(1, ncol(labelled_counts_data), 3:ncol(labelled_counts_data)-1)]

knitr::kable(labelled_counts_data[c(1:10), c(1,2)], format = "pipe", caption = "<b>Table 3:</b> Table Matching First 10 NCBI Gene IDs to the Approved HGNC Symbols")
Table 3: Table Matching First 10 NCBI Gene IDs to the Approved HGNC Symbols
NCBI_gene_id symbol
1 A1BG
100 ADA
1000 CDH2
10000 AKT3
100008587 RNA5-8SN5
100008588 RNA18SN5
100008589 RNA28SN5
100009676 ZBTB11-AS1
10001 MED6
10003 NAALAD2
labelled_counts_data <- labelled_counts_data[, -c(1)]

Now, we must deal with NA and empty values, along with many-to-one mappings.

# Remove all NA labels
no_na_labelled_counts_data <- labelled_counts_data[!is.na(labelled_counts_data$symbol), ]

# Remove all empty strings symbols
no_emp_labelled_counts_data <- no_na_labelled_counts_data[no_na_labelled_counts_data$symbol != '', ]

rownames(no_emp_labelled_counts_data) <- no_emp_labelled_counts_data[, 1]
final_normalized_data <- no_emp_labelled_counts_data[, -c(1)]

Overview Statistics

Now that we have our data, we can look at a few counting statistics to gain insight into what we’re looking at.

# Create a list of overview stats
overview_stats <- list()

# Do it for each sample
for (sample_name in colnames(final_normalized_data)) {
  sample_data <- final_normalized_data[, sample_name]
  
  total_counts <- sum(sample_data)
  mean_counts <- mean(sample_data)
  median_counts <- median(sample_data)
  sd_counts <- sd(sample_data)
  var_counts <- var(sample_data)
  overview_stats[[sample_name]] <- c(total_counts, mean_counts, median_counts, sd_counts, var_counts)
}

overview_stats_df <- do.call(rbind, overview_stats)
rownames(overview_stats_df) <- names(overview_stats)
colnames(overview_stats_df) <- c("Total Counts", "Mean Counts", "Median Counts", "Counts Standard Deviation", "Counts Variation")

knitr::kable(overview_stats_df[, ], format = "pipe", caption = "<b>Table 4:</b> Overview Statistics of Normalized Samples. The total counts are somewhat similar, but the mean counts are lower in the hPSC-derived samples. Mean counts are somewhat similar in both, but the median counts are greater in the hPSCs. The standard deviation and variation in the hPSCs is far lesser than that of the patient samples.")
Table 4: Overview Statistics of Normalized Samples. The total counts are somewhat similar, but the mean counts are lower in the hPSC-derived samples. Mean counts are somewhat similar in both, but the median counts are greater in the hPSCs. The standard deviation and variation in the hPSCs is far lesser than that of the patient samples.
Total Counts Mean Counts Median Counts Counts Standard Deviation Counts Variation
Heart 28 Day 0 H9 1021771.9 59.98426 12.24620 431.8453 186490.34
Heart 28 Day 2 H9 786599.4 46.17819 13.63105 154.4348 23850.12
Heart 28 Day 4 H9 716839.2 42.08285 13.82826 125.4171 15729.45
Heart 28 Day 10 H9 793482.4 46.58227 13.73218 236.6323 55994.84
Heart 28 Day 15 H9 745914.9 43.78977 14.72937 195.7736 38327.31
Heart 28 Day 20 H9 993073.6 58.29949 14.57508 340.7407 116104.22
Heart 28 Day 40 H9 1023479.0 60.08448 14.23242 342.5525 117342.18
Heart 28 Day 60 H9 829562.9 48.70042 14.76538 223.1736 49806.45
Heart 30 Day 0 H9 863916.5 50.71719 12.75642 218.8989 47916.72
Heart 30 Day 2 H9 771994.0 45.32077 13.16358 169.0652 28583.03
Heart 30 Day 4 H9 699609.0 41.07133 13.84685 136.2720 18570.05
Heart 30 Day 6 H9 786225.2 46.15623 14.09130 184.6986 34113.57
Heart 30 Day 10 H9 761009.1 44.67589 14.40243 189.8644 36048.49
Heart 30 Day 15 H9 908915.2 53.35888 13.90010 360.0973 129670.08
Heart 30 Day 20 H9 887759.9 52.11694 13.69645 273.7729 74951.58
Heart 30 Day 40 H9 1010387.1 59.31591 13.56070 489.7979 239901.95
Heart 30 Day 60 H9 920475.0 54.03751 14.47831 373.9279 139822.08
Heart 32 Day 0 H9 779406.7 45.75594 13.18829 149.3232 22297.42
Heart 32 Day 2 H9 770684.2 45.24387 13.37835 153.5595 23580.52
Heart 32 Day 4 H9 702719.5 41.25393 13.64478 129.9497 16886.93
Heart 32 Day 6 H9 772659.8 45.35985 14.03981 170.6316 29115.14
Heart 32 Day 15 H9 844624.9 49.58465 13.86116 271.0413 73463.40
Heart 32 Day 20 H9 831683.9 48.82493 14.49759 253.6298 64328.10
Heart 32 Day 40 H9 805256.4 47.27347 15.13632 206.4093 42604.78
Heart 32 Day 60 H9 1339814.7 78.65532 12.68147 1102.2826 1215026.94
Heart 41 Day 0 H9 747252.1 43.86827 13.25882 164.7467 27141.46
Heart 41 Day 2 H9 771285.9 45.27920 13.46844 181.9923 33121.19
Heart 41 Day 4 H9 727104.3 42.68547 13.48812 172.0213 29591.32
Heart 41 Day 6 H9 744472.3 43.70508 14.23815 176.5771 31179.46
Heart 41 Day 10 H9 800151.2 46.97377 14.00636 311.0775 96769.22
Heart 41 Day 15 H9 927299.3 54.43814 13.63959 425.1034 180712.93
Heart 41 Day 20 H9 809279.0 47.50963 14.63262 230.8466 53290.15
Heart 41 Day 40 H9 826540.3 48.52297 15.18484 211.2894 44643.22
Heart 18 Day 0 MLC2V 802319.7 47.10108 13.25450 158.2561 25045.00
Heart 18 Day 2 MLC2V 764355.6 44.87235 13.46746 169.3843 28691.04
Heart 18 Day 4 MLC2V 707001.2 41.50530 14.09808 150.1960 22558.85
Heart 18 Day 6 MLC2V 769018.6 45.14609 14.25261 169.0281 28570.48
Heart 18 Day 10 MLC2V 791369.4 46.45822 14.52728 219.7194 48276.59
Heart 18 Day 15 MLC2V 791232.2 46.45017 14.62889 233.2285 54395.55
Heart 18 Day 20 MLC2V 899519.9 52.80732 14.82049 244.6640 59860.47
Heart 18 Day 40 MLC2V 935508.8 54.92009 14.67489 361.6569 130795.68
Heart 18 Day 60 MLC2V 933355.7 54.79369 14.68259 421.3648 177548.34
Heart 42 Day 0 MLC2V 816656.5 47.94273 13.07401 196.8750 38759.78
Heart 42 Day 2 MLC2V 789418.7 46.34370 13.20376 183.5629 33695.34
Heart 42 Day 4 MLC2V 730009.3 42.85601 13.51457 162.4742 26397.86
Heart 42 Day 6 MLC2V 779092.2 45.73748 13.98386 184.5081 34043.25
Heart 42 Day 10 MLC2V 860421.6 50.51201 14.31508 232.9254 54254.26
Heart 42 Day 15 MLC2V 775917.9 45.55113 14.74511 236.1324 55758.50
Heart 42 Day 20 MLC2V 849629.4 49.87844 14.70218 252.2778 63644.10
Heart 42 Day 40 MLC2V 931343.7 54.67557 14.76092 407.5923 166131.48
Heart 42 Day 60 MLC2V 972308.9 57.08048 15.28911 496.6643 246675.41
Patient 1 Left Ventricle Isolated Cells 2097613.3 123.14273 12.71544 2054.7924 4222171.89
Patient 1 Right Ventricle Isolated cells 1801820.6 105.77789 12.64497 1652.3177 2730153.91
Patient 2 Left Ventricle Isolated Cells 2448968.4 143.76943 11.90376 2919.5824 8523961.52
Patient 2 Right Ventricle Isolated Cells 2338963.9 137.31149 11.48868 2609.5704 6809857.69
Patient 3 Left Ventricle Isolated Cells 1955785.5 114.81657 12.28001 1785.9598 3189652.30
Patient 3 Right Ventricle Isolated Cells 1841091.5 108.08333 12.63626 1779.0770 3165115.09
Patient 4 Left Atria Tissue 1289436.8 75.69783 15.15308 1052.6384 1108047.51
Patient 4 Left Ventricle Tissue 1993945.7 117.05681 14.52426 2527.4069 6387785.88
Patient 4 Right Atria Tissue 1289380.0 75.69449 14.74821 1068.7044 1142129.05
Patient 4 Right Ventricle Tissue 1729378.7 101.52511 13.97466 1684.3179 2836926.68
Patient 5 Left Atria Tissue 1374196.6 80.67374 14.72154 1212.6323 1470477.14
Patient 5 Left Ventricle Tissue 1610446.1 94.54304 14.15593 1607.2241 2583169.31
Patient 5 Right Atria Tissue 1383940.8 81.24579 15.08133 1388.4255 1927725.46
Patient 5 Right Ventricle Tissue 1649782.1 96.85230 14.02391 1703.4872 2901868.80
Patient 6 Left Atria Tissue 1432786.4 84.11333 14.46312 1383.8852 1915138.25
Patient 6 Left Ventricle Tissue 1393964.4 81.83424 14.18233 1300.1378 1690358.21
Patient 6 Right Atria Tissue 1273212.8 74.74538 14.75168 1055.1550 1113352.02
Patient 6 Right Ventricle Tissue 1460262.8 85.72636 14.21627 1364.0782 1860709.32

MDS Plots

We will now use MDS plots, which involve using dimensional reduction techniques to observe clustering of our samples. (Ritchie et al. 2015)

par(xpd=T, mar=par()$mar+c(1.5,0,0,3))

# Manually specify the order of the conditions
ordered_conditions <- c("0", "2", "4", "6", "10", "15", "20", "40", "60", "LA", "LV", "RA", "RV")

# Create a factor with the specified order
conditions_factor <- factor(c(sample_info_1$day, sample_info_2$chamber), levels = ordered_conditions)

# Generate a color palette with the same number of colors as the number of conditions
n_colours <- length(ordered_conditions)
palette <- rainbow(n_colours)

# Plot the MDS with the ordered conditions and corresponding colors
limma::plotMDS(d, pch = 1, col = palette[conditions_factor])

# Add a title and legend
mtext("Figure 1: MDS Showing Clustering of Patient Samples\nFrom Temporally Clustered hPSC-Derived CMs", side=1, line=5)
legend("bottomright", 
       legend = ordered_conditions,
       inset = c(-0.2, 0),
       pch = c(1), col = palette, title = "Day/Chamber",
       bty = 'n', cex = 0.75)

par(mar=c(5, 4, 4, 2) + 0.1)

We can see that there is strong clustering of all the patient-derived samples, and there’s also temporal clustering of all the stem cell-derived CMs. This is evidenced by the fact the from left to right, we see a ‘continuous’ clustering of cardiomyocytes from d60 to d0. We see that the patient samples all cluster together quite strongly, but further away in the firsta and second dimensions. This can likely be explained due to higher variance and low median counts in the patient data, but the fact that the cells came from warm cadavers, and the cell niche The range for the MDS Leading logFC dimensions are also quite high, which could be explained due to having two different types of data here.

Differential Gene Expression

We will now prefer differential gene expression analyses. Based on the results of the previously generated MDS plots, I think that important analyses would be: * Pairwise comparisons between each time point of the hPSC to cardiomyocyte differentiation + Specifically a comparison between day 0 and day 60 of the stem cell experiments - Comparison of the day 60 * hPSC-cardiomyocyte to the rest of the patient heart cells

Determining p-Values

We will determine the p-values of each of the experiment using edgeR. (Robinson, McCarthy, and Smyth 2010)

# 1. Prepare the data

# Truncate the sample names
converted_names <- gsub("Heart \\d+ Day (\\d+) .*", "day \\1", colnames(final_normalized_data))
converted_names <- gsub("Patient \\d+ (Left|Right) (Ventricle|Atria) .*", "\\1 \\2", converted_names)
converted_names <- gsub(" ", "", converted_names)
# Create a data frame with sample information
samples <- data.frame(converted_names)

# Convert the count data to a DGEList object
dge <- DGEList(counts = final_normalized_data)

# 2. Create the design matrix
# Create the design matrix with only 'day' as a factor
design <- model.matrix(~0 + factor(samples$converted_names))

# Rename the columns to be more interpretable
colnames(design) <- gsub("factor\\(samples\\$converted_names\\)", "", colnames(design))

dge <- estimateDisp(dge, design)

# 4. Fit the model
fit <- glmQLFit(dge, design)

Now, we will create the contrast matrices for comparisons.

# 5. Create contrast matrices
twoVS0con <- makeContrasts(
    day2vsday0 = day2 - day0,
    levels = design
)
fourVS2con <- makeContrasts(
    day4vsday2 = day4 - day2,
    levels = design
)
sixVS4con <- makeContrasts(
    day6vsday4 = day6 - day4,
    levels = design
)
tenVS6con <- makeContrasts(
    day10vsday6 = day10 - day6,
    levels = design
)
fifteenVS10con <- makeContrasts(
    day15vsday10 = day15 - day10,
    levels = design
)
twentyVS15con <- makeContrasts(
    day20vsday15 = day20 - day15,
    levels = design
)
fortyVS20con <- makeContrasts(
    day40vsday20 = day40 - day20,
    levels = design
)
sixtyVS40con <- makeContrasts(
    day60vsday40 = day60 - day40,
    levels = design
)
sixtyVS0con <- makeContrasts(
    day60vsday0 = day60 - day0,
    levels = design
)
lvVS60 <- makeContrasts(
    LVvsday0 = LeftVentricle - day60,
    levels = design
)
rvVS60 <- makeContrasts(
    RVvsday60 = RightVentricle - day60,
    levels = design
)
laVS60 <- makeContrasts(
    LAvsday60 = LeftAtria - day60,
    levels = design
)
raVS60 <- makeContrasts(
    RAvsday60 = RightAtria - day60,
    levels = design
)
lvVSrv <- makeContrasts(
    lvvsrv = LeftVentricle - RightVentricle,
    levels = design
)

Now, we can find differential expression between conditions

# 6. Perform the tests
twoVS0_lrt <- glmQLFTest(fit, contrast = twoVS0con)
fourVS2_lrt <- glmQLFTest(fit, contrast = fourVS2con)
sixVS4_lrt <- glmQLFTest(fit, contrast = sixVS4con)
tenVS6_lrt <- glmQLFTest(fit, contrast = tenVS6con)
fifteenVS10_lrt <- glmQLFTest(fit, contrast = fifteenVS10con)
twentyVS15_lrt <- glmQLFTest(fit, contrast = twentyVS15con)
fortyVS20_lrt <- glmQLFTest(fit, contrast = fortyVS20con)
sixtyVS40_lrt <- glmQLFTest(fit, contrast = sixtyVS40con)
sixtyVS0_lrt <- glmQLFTest(fit, contrast = sixtyVS0con)
lvVS60_lrt <- glmQLFTest(fit, contrast = lvVS60)
rvVS60_lrt <- glmQLFTest(fit, contrast = rvVS60)
laVS60_lrt <- glmQLFTest(fit, contrast = laVS60)
raVS60_lrt <- glmQLFTest(fit, contrast = raVS60)
lvVSrv_lrt <- glmQLFTest(fit, contrast = lvVSrv)

# 7. Extract results
results_twoVS0 <- topTags(twoVS0_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_fourVS2 <- topTags(fourVS2_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_sixVS4 <- topTags(sixVS4_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_tenVS6 <- topTags(tenVS6_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_fifteenVS10 <- topTags(fifteenVS10_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_twentyVS15 <- topTags(twentyVS15_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_fortyVS20 <- topTags(fortyVS20_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_sixtyVS40 <- topTags(sixtyVS40_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_sixtyVS0 <- topTags(sixtyVS0_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_lvVS60 <- topTags(lvVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_rvVS60 <- topTags(rvVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_laVS60 <- topTags(laVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_raVS60 <- topTags(raVS60_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")
results_lvVSrv <- topTags(lvVSrv_lrt, sort.by = "PValue", n = nrow(final_normalized_data), adjust.method = "BH")

I chose to use a cut-off of 0.01, in-line with the original study, and the limit the amount of genes that show up as significant, as there is a lot of temporal variation in stem cell differentiation protocols. In terms of the multiple hypothesis correction, I chose to use Benjamini-Hochberg for its robustness and that it seems quite standard in such analyses.

We will now present summary data in a table for each comparison.

summary_table <- data.frame(
  Comparison = c("Day 2 VS Day 0",
                 "Day 4 VS Day 2",
                 "Day 6 VS Day 4",
                 "Day 10 VS Day 6",
                 "Day 15 VS Day 10",
                 "Day 20 VS Day 15",
                 "Day 40 VS Day 20",
                 "Day 60 VS Day 40",
                 "Day 60 VS Day 0",
                 "LV VS Day 60",
                 "RV VS Day 60",
                 "LA VS Day 60",
                 "RA VS Day 60",
                 "LV VS RV"
                 ),
  Significant = c(length(which(results_twoVS0$table$PValue < 0.01)),
                  length(which(results_fourVS2$table$PValue < 0.01)),
                  length(which(results_sixVS4$table$PValue < 0.01)),
                  length(which(results_tenVS6$table$PValue < 0.01)),
                  length(which(results_fifteenVS10$table$Value < 0.01)),
                  length(which(results_twentyVS15$table$PValue < 0.01)),
                  length(which(results_fortyVS20$table$PValue < 0.01)),
                  length(which(results_sixtyVS40$table$PValue < 0.01)),
                  length(which(results_sixtyVS0$table$PValue < 0.01)),
                  length(which(results_lvVS60$table$PValue < 0.01)),
                  length(which(results_rvVS60$table$PValue < 0.01)),
                  length(which(results_laVS60$table$PValue < 0.01)),
                  length(which(results_raVS60$table$PValue < 0.01)),
                  length(which(results_lvVSrv$table$PValue < 0.01))
                  ),
  Corrected = c(length(which(results_twoVS0$table$FDR < 0.01)),
                length(which(results_fourVS2$table$FDR < 0.01)),
                length(which(results_sixVS4$table$FDR < 0.01)),
                length(which(results_tenVS6$table$FDR < 0.01)),
                length(which(results_fifteenVS10$table$FDR < 0.01)),
                length(which(results_twentyVS15$table$FDR < 0.01)),
                length(which(results_fortyVS20$table$FDR < 0.01)),
                length(which(results_sixtyVS40$table$FDR < 0.01)),
                length(which(results_sixtyVS0$table$FDR < 0.01)),
                length(which(results_lvVS60$table$FDR < 0.01)),
                length(which(results_rvVS60$table$FDR < 0.01)),
                length(which(results_laVS60$table$FDR < 0.01)),
                length(which(results_raVS60$table$FDR < 0.01)),
                length(which(results_lvVSrv$table$FDR < 0.01))
                )
)

kable(summary_table, caption = "Table 5: The number of genes that show the number of significantly differentially expressed genes before and after FDR correction. For the temporal analysis, the first two comparisons hover around 1800 genes, the next two are at around 900, and then there's a steep drop off at the day 15 vs 10 and day 20 vs 15 comparisons. This picks back up. This picks back up intil days 60 to 40, showing minimal change, suggesting that they're in late stage maturation. Between day 60 and patient samples, many genes, all above 6000, are significantly differentially expressed. Between ventricles, no change is observed. ")
Table 5: The number of genes that show the number of significantly differentially expressed genes before and after FDR correction. For the temporal analysis, the first two comparisons hover around 1800 genes, the next two are at around 900, and then there’s a steep drop off at the day 15 vs 10 and day 20 vs 15 comparisons. This picks back up. This picks back up intil days 60 to 40, showing minimal change, suggesting that they’re in late stage maturation. Between day 60 and patient samples, many genes, all above 6000, are significantly differentially expressed. Between ventricles, no change is observed.
Comparison Significant Corrected
Day 2 VS Day 0 3230 1895
Day 4 VS Day 2 3006 1774
Day 6 VS Day 4 2056 866
Day 10 VS Day 6 2301 1006
Day 15 VS Day 10 0 64
Day 20 VS Day 15 1641 416
Day 40 VS Day 20 2501 1163
Day 60 VS Day 40 464 15
Day 60 VS Day 0 9281 8768
LV VS Day 60 10698 10290
RV VS Day 60 10423 9999
LA VS Day 60 7281 6360
RA VS Day 60 7082 6126
LV VS RV 17 0

As expected, the comparison with the greatest number of differentially expressed genes was between day 60 and day 0, which compares an hPSC to a left ventricular cardiomyocyte derived from stem cells. Interestingly, we see a very small amount of differentially expressed genes for day 15 vs day 10 and day 60 and day 40. In terms of the patient sample comparisons, we see that all of them have a great number of genes with a significant values of differential expression, which indicates that even the mature hPSC-derived CM is very far from the patient by transcriptome. This could be explained due to the niche of the cell–cardiomyocytes in vivo are constantly beating, working to pump blood, and are also surrounded by other tissue types which secrete signals to regulate expression of different transcripts. In a dish, this isn’t the case, which could explain this phenomenon. We can investigate this more closely once we see which genes come out as top hits.

We will save our results in a list for ease of downstream analysis.

# List of all the results objects
results_list <- list(
  results_twoVS0 = results_twoVS0,
  results_fourVS2 = results_fourVS2,
  results_sixVS4 = results_sixVS4,
  results_tenVS6 = results_tenVS6,
  results_fifteenVS10 = results_fifteenVS10,
  results_twentyVS15 = results_twentyVS15,
  results_fortyVS20 = results_fortyVS20,
  results_sixtyVS40 = results_sixtyVS40,
  results_sixtyVS0 = results_sixtyVS0,
  results_lvVS60 = results_lvVS60,
  results_rvVS60 = results_rvVS60,
  results_laVS60 = results_laVS60,
  results_raVS60 = results_raVS60,
  results_lvVSrv = results_lvVSrv
)

# Titles for each plot
titles <- c(
  "Day 2 vs Day 0",
  "Day 4 vs Day 2",
  "Day 6 vs Day 4",
  "Day 10 vs Day 6",
  "Day 15 vs Day 10",
  "Day 20 vs Day 15",
  "Day 40 vs Day 20",
  "Day 60 vs Day 40",
  "Day 60 vs Day 0",
  "LV vs Day 60",
  "RV vs Day 60",
  "LA vs Day 60",
  "RA vs Day 60",
  "LV vs RV"
)

Volcano Plots

We will use volcano plots to show differentially expressed genes. To do this efficiently, we will make a function for making volcano plot. To determine top hits, we will filter by having an FDR of less that 0.01, and having an absolute logFC value of 2. This is in line with the paper’s protocol.

create_volcano_plot <- function(data, title, logFC_threshold = 2, FDR_threshold = 0.01, top_n = 5) {
  # Add gene names as a column
  data$Gene <- rownames(data)

  # Add a column to indicate the direction of regulation
  data$direction <- with(data, ifelse(logFC > logFC_threshold & FDR < FDR_threshold, "Upregulated",
                                      ifelse(logFC < -logFC_threshold & FDR < FDR_threshold, "Downregulated", "Neutral")))
  
  # Select top genes based on FDR and absolute logFC
  top_genes <- data %>%
    filter(direction %in% c("Upregulated", "Downregulated")) %>%
    arrange(FDR, abs(logFC)) %>%
    group_by(direction) %>%
    slice_head(n = top_n) %>%
    ungroup()
  
  # Create the volcano plot
  volcano_plot <- ggplot(data, aes(x = logFC, y = -log10(FDR), color = direction)) +  
    geom_point(alpha = 0.5) + 
    theme_minimal() +
    labs(title = title, x = "Log Fold Change", y = "-log10(FDR)") +
    scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "blue", "Neutral" = "grey")) +
    theme(legend.title = element_blank()) +
    geom_vline(xintercept = c(-logFC_threshold, logFC_threshold), linetype = "dashed", color = "black") +
    geom_hline(yintercept = -log10(FDR_threshold), linetype = "dashed", color = "black") +
    geom_text_repel(data = top_genes, aes(label = Gene), size = 3, nudge_y = 1) 
  
  return(volcano_plot)
}

Now, let’s plot

# Create a plot for each comparison
plots <- lapply(seq_along(results_list), function(i) {
  create_volcano_plot(results_list[[i]]$table, titles[i])
})

# Display the plots
for (i in seq_along(plots)) {
  print(plots[[i]])
}

Figure 2: Volcano plots for all comparisons. All are displayed for breadth of analysis. The top hits are also displayed. The interpretation for these results is discussed in the table heading for Table 5 and after.

Filtering for Top Hits

We will now filter with the same parameters used to filter in the volcano plots.

# Filter each comparison based on the specified criteria
filtered_results_list <- lapply(results_list, function(result) {
  result$table %>%
    filter(FDR < 0.01, abs(logFC) > 2)
})

# Names for the filtered results
names(filtered_results_list) <- names(results_list)

filtered_results_display_df <- data.frame(
  Num_Genes = sapply(filtered_results_list, nrow)
)

kable(filtered_results_display_df, caption = "Table 6: Number of genes meeting criteria for each comparison. Notice that because of the logFC conition, all values are necessarily equal to or smaller than the values in Table 5.")
Table 6: Number of genes meeting criteria for each comparison. Notice that because of the logFC conition, all values are necessarily equal to or smaller than the values in Table 5.
Num_Genes
results_twoVS0 563
results_fourVS2 616
results_sixVS4 368
results_tenVS6 251
results_fifteenVS10 20
results_twentyVS15 132
results_fortyVS20 323
results_sixtyVS40 9
results_sixtyVS0 3720
results_lvVS60 3836
results_rvVS60 3762
results_laVS60 2369
results_raVS60 2289
results_lvVSrv 0

Filtering for All and Unique Differentially Expressed Genes

# Extract gene names from each filtered result and combine them into a single vector
union_gene_names <- unlist(lapply(filtered_results_list, function(filtered_result) {
  rownames(filtered_result)
}))

# Remove duplicate gene names
all_genes <- unique(union_gene_names)

We will also need single genes for downstream analysis.

# Get only the unique genes for downstream analysis
gene_counts <- table(union_gene_names)
unique_genes <- names(gene_counts[gene_counts == 1])

Heat Map

We will now present our analysis, and specifically clustering of samples using a heatmap by using the Complex HeatMap library. (Gu 2022)

# Subset the normalized count matrix to keep only the genes of interest, ensure no NAs
heatmap_data <- final_normalized_data[all_genes, , drop = FALSE]
heatmap_data <- heatmap_data[complete.cases(heatmap_data), ]

# Scale the expression data
scaled_heatmap_data <- t(scale(t(heatmap_data)))

# Order the converted_names
ordered_converted_names <- unique(converted_names)
day6element <- ordered_converted_names[9]
ordered_converted_names <- ordered_converted_names[-9]
ordered_converted_names <- append(ordered_converted_names, day6element, after=3)

# Generate a color vector for conditions, using rainbow and setting the names attribute
condition_colors <- setNames(rainbow(length(ordered_converted_names)), ordered_converted_names)

# Create annotations for the heatmap
annotations_df <- data.frame(Condition = factor(converted_names, levels = ordered_converted_names))

# Create annotations for the heatmap
heatmap_annotations <- HeatmapAnnotation(Condition = annotations_df$Condition, col = list(Condition = condition_colors), show_legend = TRUE)

# Generate the heatmap
heatmap <- Heatmap(
  as.matrix(scaled_heatmap_data),
  top_annotation = heatmap_annotations,
  cluster_rows = TRUE,
  cluster_columns = TRUE, # Turn off column clustering to maintain the order of days and conditions
  col = colorRamp2(c(-2, 0, 4), c("darkgreen", "white", "darkmagenta")), # Adjust colors as needed
  show_column_names = FALSE, # Turn on to see the order of columns
  show_row_names = FALSE,
  show_heatmap_legend = TRUE,
  column_title = "Figure 3: Gene Expression of all Significantly Differentially Expressed Genes"
)
## `use_raster` is automatically set to TRUE for a matrix with more than
## 2000 rows. You can control `use_raster` argument by explicitly setting
## TRUE/FALSE to it.
## 
## Set `ht_opt$message = FALSE` to turn off this message.
# Draw the heatmap
draw(heatmap)

The conditions cluster somewhat well, but out of order temporally. This may be because many genes may be up and down-regulated during differentiation, and as a result, a gene coming on-off-on or vice-versa may result in different timepoint clustering together exclusively through transcriptomes and not temporally. Interestingly, the patient samples cluster well together as expected, but they are adjacent to day 6, which is unusual. My best explanation may be because of variance, but I will try to better cluster these using unique genes for each condition instead. By doing this, we get rid of genes that are significant more than once. This way, the genes act as better ‘signatures’ of each condition.

# Subset the normalized count matrix to keep only the genes of interest, ensure no NAs
unique_heatmap_data <- final_normalized_data[unique_genes, , drop = FALSE]
unique_heatmap_data <- unique_heatmap_data[complete.cases(unique_heatmap_data), ]

# Scale the expression data
unqiue_scaled_heatmap_data <- t(scale(t(unique_heatmap_data)))

# Generate the heatmap
unique_heatmap <- Heatmap(
  as.matrix(unqiue_scaled_heatmap_data),
  top_annotation = heatmap_annotations,
  cluster_rows = TRUE,
  cluster_columns = TRUE, # Turn off column clustering to maintain the order of days and conditions
  col = colorRamp2(c(-2, 0, 4), c("darkgreen", "white", "darkmagenta")), # Adjust colors as needed
  show_column_names = FALSE, # Turn on to see the order of columns
  show_row_names = FALSE,
  show_heatmap_legend = TRUE,
  column_title = "Figure 4: Gene Expression of Unique Significantly Differentially Expressed Genes"
)
## `use_raster` is automatically set to TRUE for a matrix with more than
## 2000 rows. You can control `use_raster` argument by explicitly setting
## TRUE/FALSE to it.
## 
## Set `ht_opt$message = FALSE` to turn off this message.
# Draw the heatmap
draw(unique_heatmap)

This time, we better clustering, and ‘colour-blocking’ in the matrix itself. Once again, the patient samples are sandwiched between days 6 and 10. This may be because the normalization technique used may not have been as appropriate for such analysis, and significant noise may be in samples. In the future, I may incorporate a library like noisyR (Moutsopoulos et al. 2021)

Over-Representation Analysis

Defining Up and Down-Regulated Gene Sets

Here, we will define up and down-regulated gene sets. The goal of this function block was to create a general overall list of genes that were up and down-regulated for each condition, and to create a pooled one as well. In terms of significance values, we’ll use a cut-off of FDR < 0.05 and the magnitude of logFC being 0. These are the same parameters the authors used in their analysis.

ora_filtered_results_list <- lapply(results_list, function(result) {
  result$table %>%
    filter(FDR < 0.05)
})

up_regulated_results_list <- lapply(ora_filtered_results_list, function(result) {
  result %>%
    filter(logFC > 0)
})

up_regulated_results_list <- lapply(up_regulated_results_list, function(result) {
  rownames(result)
})

down_regulated_results_list <- lapply(ora_filtered_results_list, function(result) {
  result %>%
    filter(logFC < 0)
})

down_regulated_results_list <- lapply(down_regulated_results_list, function(result) {
  rownames(result)
})

all_up_regulated_genes <- unique(unlist(up_regulated_results_list))
all_down_regulated_genes <- unique(unlist(down_regulated_results_list))

Setting up the g:Profiler

Let’s set up the g:Profiler for all of the aforementioned combinations (per condition, for significant, up and down-regulated genes). (Reimand et al. 2007) We will be using the g:Profiler since it’s a popular and robust tool for ORA, and it was also a technique learned in a previous homework assignment. The paper only uses the GO BP annotation dataset, and we will be doing the same. This dataset …

go_upreg <- gost(
  query = all_up_regulated_genes, 
  organism = "hsapiens", 
  sources = c("GO:BP")
  )

go_downreg <- gost(
  query = all_down_regulated_genes, 
  organism = "hsapiens", 
  sources = c("GO:BP")
  )

go_updownreg <- gost(
  query = list(Downregulated = all_down_regulated_genes, Upregulated = all_up_regulated_genes), 
  organism = "hsapiens", 
  sources = c("GO:BP")
  )

Analyzing All Up and Down-Regulated Genes, Pooled

Let’s look at the results of these:

numerical_go_summary_df <- data.frame(
  Analysis <- c("Up and Down-Regulated", "Up-Regulated", "Down-Regulated"),
  Pathways <- c(
    nrow(go_updownreg$result),
    nrow(go_upreg$result),
    nrow(go_downreg$result)
  )
)

kable(numerical_go_summary_df, col.names = c("Analysis", "Number of Pathways"), caption = "Table 7: The number of pathways that are up/down-regulated for each condition, and when they're pooled together. Notice that the the when analyzed together, the number of pathways are the sum of te individual conditions.")
Table 7: The number of pathways that are up/down-regulated for each condition, and when they’re pooled together. Notice that the the when analyzed together, the number of pathways are the sum of te individual conditions.
Analysis Number of Pathways
Up and Down-Regulated 3150
Up-Regulated 1604
Down-Regulated 1546

We can see that running the analysis together or separately still yields the same number of total pathways being recognized.

Let’s look at a few up-regulated terms.

We will filter pathways with greater than 1000 terms, since they could be ‘ubiquitous’ and yield false positives. Start with up-regulation.

filtered_go_upreg <- go_updownreg$result[
  which(go_updownreg$result$query == "Upregulated" & go_updownreg$result$term_size <= 1000 ),]

kable(filtered_go_upreg[1:10, c(11, 3, 1)],
      caption = "Table 8: Top 10 upregulated pathways. The BH values show up as zero since they are so small.",
      col.names = c("Pathway", "P Value (BH)","Direction")
      )
Table 8: Top 10 upregulated pathways. The BH values show up as zero since they are so small.
Pathway P Value (BH) Direction
1584 tube morphogenesis 0 Upregulated
1585 heart development 0 Upregulated
1594 vasculature development 0 Upregulated
1597 muscle structure development 0 Upregulated
1599 enzyme-linked receptor protein signaling pathway 0 Upregulated
1605 blood vessel development 0 Upregulated
1613 regulation of cell migration 0 Upregulated
1615 regulation of cell motility 0 Upregulated
1619 blood vessel morphogenesis 0 Upregulated
1622 circulatory system process 0 Upregulated

And now down-regulation.

filtered_go_downreg <- go_updownreg$result[
  which(go_updownreg$result$query == "Downregulated" & go_updownreg$result$term_size <= 1000 ),]

kable(filtered_go_downreg[1:10, c(11, 3, 1)],
      caption = "Table 9: Top 10 downregulated pathways. The BH values show up as zero since they are so small.",
      col.names = c("Pathway", "P Value (BH)","Direction")
      )
Table 9: Top 10 downregulated pathways. The BH values show up as zero since they are so small.
Pathway P Value (BH) Direction
71 mitotic cell cycle 0 Downregulated
73 DNA damage response 0 Downregulated
78 mitotic cell cycle process 0 Downregulated
93 cell morphogenesis 0 Downregulated
98 ncRNA metabolic process 0 Downregulated
101 DNA repair 0 Downregulated
102 intracellular protein transport 0 Downregulated
103 protein catabolic process 0 Downregulated
107 cell division 0 Downregulated
111 ribosome biogenesis 0 Downregulated

Analyzing for Pairwise Comparisons

We will extract up and down-regualted genes using the same conditions for all conditions.

# Initialize lists to hold the gene sets
upregulated_genesets <- list()
downregulated_genesets <- list()

# Iterate over the filtered results list
for (condition_name in names(ora_filtered_results_list)) {
  # Extract the data frame for the condition
  condition_df <- ora_filtered_results_list[[condition_name]]
  
  # Filter for upregulated genes (FDR < 0.05 and logFC > 0)
  upregulated_genes <- condition_df %>% 
    filter(FDR < 0.05, logFC > 0) %>% 
    rownames()
  
  # Filter for downregulated genes (FDR < 0.05 and logFC < 0)
  downregulated_genes <- condition_df %>% 
    filter(FDR < 0.05, logFC < 0) %>% 
    rownames()
  
  # Add to the lists
  upregulated_genesets[[condition_name]] <- upregulated_genes
  downregulated_genesets[[condition_name]] <- downregulated_genes
}

First, we’ll look at the temporal comparisons.

# Initialize an empty list to hold the GO analysis for each comparison
days_analysis_list <- vector("list", 9)

# Iterate over each comparison
for (i in 1:9) {
  cond <- names(ora_filtered_results_list)[i]
  # Perform g:Profiler analysis for upregulated and downregulated genes separately
  go_analysis <- gost(
    query = list(Downregulated = downregulated_genesets[[cond]],
                 Upregulated = upregulated_genesets[[cond]]),
    organism = "hsapiens",
    sources = c("GO:BP")
  )
  
  # Add the results to the list
  days_analysis_list[[cond]] <- go_analysis
}

# Remove the first 9 empty entries
days_analysis_list <- days_analysis_list[-(1:9)]

And now the patient ones.

# Initialize an empty list to hold the GO analysis for each comparison
patient_analysis_list <- vector("list", 4)

# Iterate over each comparison
for (i in 10:13) {
  cond <- names(ora_filtered_results_list)[i]
  # Perform g:Profiler analysis for upregulated and downregulated genes separately
  go_analysis <- gost(
    query = list(Downregulated = downregulated_genesets[[cond]],
                 Upregulated = upregulated_genesets[[cond]]),
    organism = "hsapiens",
    sources = c("GO:BP")
  )
  
  # Add the results to the list
  patient_analysis_list[[cond]] <- go_analysis
}

# Remove the first 4 empty entries
patient_analysis_list <- patient_analysis_list[-(1:4)]

We will now display both.

# Initialize an empty data frame to hold the top pathways for each condition and direction
top_pathways_df <- data.frame(matrix(ncol = length(names(days_analysis_list)) * 2, nrow = 10))

# Set the column names for the data frame
colnames(top_pathways_df) <- paste(rep(names(days_analysis_list), each = 2), rep(c("Upregulated", "Downregulated"), times = length(names(days_analysis_list))), sep = "_")

# Fill the data frame with the top pathways
for (cond in names(days_analysis_list)) {
  # Extract the top 10 upregulated pathways
  top_upreg <- days_analysis_list[[cond]]$result %>%
    filter(query == "Upregulated", term_size <= 1000) %>%
    arrange(p_value) %>%
    slice(1:10) %>%
    .$term_name

  # Ensure that we have exactly 10 entries by padding with NA if necessary
  top_upreg <- top_upreg[1:10]
  if (length(top_upreg) < 10) {
    top_upreg <- c(top_upreg, rep(NA, 10 - length(top_upreg)))
  }

  # Extract the top 10 downregulated pathways
  top_downreg <- days_analysis_list[[cond]]$result %>%
    filter(query == "Downregulated", term_size <= 1000) %>%
    arrange(p_value) %>%
    slice(1:10) %>%
    .$term_name

  # Ensure that we have exactly 10 entries by padding with NA if necessary
  top_downreg <- top_downreg[1:10]
  if (length(top_downreg) < 10) {
    top_downreg <- c(top_downreg, rep(NA, 10 - length(top_downreg)))
  }

  # Add the pathways to the data frame
  top_pathways_df[paste(cond, "Upregulated", sep = "_")] <- top_upreg
  top_pathways_df[paste(cond, "Downregulated", sep = "_")] <- top_downreg
}

# Create the kable
kable(top_pathways_df, caption = "Table 10: Top 10 upregulated and downregulated pathways for each temporal comparison. Intepretation of these results are in the Interpretation section.", align = 'c')
Table 10: Top 10 upregulated and downregulated pathways for each temporal comparison. Intepretation of these results are in the Interpretation section.
results_twoVS0_Upregulated results_twoVS0_Downregulated results_fourVS2_Upregulated results_fourVS2_Downregulated results_sixVS4_Upregulated results_sixVS4_Downregulated results_tenVS6_Upregulated results_tenVS6_Downregulated results_fifteenVS10_Upregulated results_fifteenVS10_Downregulated results_twentyVS15_Upregulated results_twentyVS15_Downregulated results_fortyVS20_Upregulated results_fortyVS20_Downregulated results_sixtyVS40_Upregulated results_sixtyVS40_Downregulated results_sixtyVS0_Upregulated results_sixtyVS0_Downregulated
regulation of anatomical structure morphogenesis neuron projection morphogenesis heart development ribosome biogenesis heart development DNA replication muscle system process chromosome organization muscle system process cellular response to growth factor stimulus actin cytoskeleton organization cytoplasmic translation external encapsulating structure organization mitotic cell cycle NA fructose metabolic process muscle structure development ribosome biogenesis
embryonic morphogenesis plasma membrane bounded cell projection morphogenesis muscle structure development ncRNA metabolic process muscle structure development DNA-templated DNA replication muscle contraction mitotic cell cycle striated muscle contraction response to growth factor actin filament-based process peptide metabolic process extracellular matrix organization mitotic cell cycle process NA carbohydrate phosphorylation heart development ncRNA metabolic process
tissue morphogenesis cell projection morphogenesis tissue morphogenesis rRNA metabolic process muscle system process DNA repair striated muscle contraction mitotic cell cycle process actin filament-based movement regulation of cell migration supramolecular fiber organization translation extracellular structure organization chromosome organization NA fructose 2,6-bisphosphate metabolic process tube morphogenesis ncRNA processing
heart development cellular anatomical entity morphogenesis heart morphogenesis rRNA processing muscle tissue development DNA damage response oxoacid metabolic process chromosome segregation cardiac muscle contraction NA cell junction organization peptide biosynthetic process cellular response to growth factor stimulus chromosome segregation NA monosaccharide metabolic process vasculature development translation
head development cell morphogenesis mesenchyme development ncRNA processing striated muscle tissue development chromatin organization organic acid metabolic process nuclear chromosome segregation regulation of muscle system process NA apoptotic signaling pathway amide biosynthetic process circulatory system process mitotic nuclear division NA NA blood vessel development rRNA metabolic process
gastrulation cell morphogenesis involved in neuron differentiation enzyme-linked receptor protein signaling pathway ribosomal small subunit biogenesis cardiac muscle tissue development protein-DNA complex organization carboxylic acid metabolic process cell division actin-mediated cell contraction NA enzyme-linked receptor protein signaling pathway transmembrane receptor protein serine/threonine kinase signaling pathway response to growth factor organelle fission NA NA muscle tissue development amide biosynthetic process
morphogenesis of an epithelium chemical synaptic transmission tube morphogenesis ribosomal large subunit biogenesis muscle contraction chromosome organization generation of precursor metabolites and energy nuclear division muscle contraction NA cell morphogenesis transforming growth factor beta receptor superfamily signaling pathway regulation of system process nuclear division NA NA muscle cell differentiation peptide biosynthetic process
tube morphogenesis anterograde trans-synaptic signaling mesenchymal cell differentiation maturation of SSU-rRNA actin filament-based process double-strand break repair monocarboxylic acid metabolic process sister chromatid segregation cardiac muscle cell contraction NA ameboidal-type cell migration enzyme-linked receptor protein signaling pathway enzyme-linked receptor protein signaling pathway mitotic sister chromatid segregation NA NA striated muscle tissue development rRNA processing
embryo development ending in birth or egg hatching trans-synaptic signaling embryonic morphogenesis peptide metabolic process muscle cell differentiation chromatin remodeling heart process regulation of cell cycle process regulation of system process NA cell adhesion mediated by integrin protein-lipid complex remodeling camera-type eye development sister chromatid segregation NA NA cardiac muscle tissue development DNA damage response
chordate embryonic development regulation of cell projection organization cellular anatomical entity morphogenesis peptide biosynthetic process striated muscle cell differentiation negative regulation of transcription by RNA polymerase II cardiac muscle contraction DNA replication regulation of muscle adaptation NA response to wounding plasma lipoprotein particle remodeling sensory system development nuclear chromosome segregation NA NA blood vessel morphogenesis chromosome organization
# Initialize an empty data frame to hold the top pathways for each condition and direction
top_patient_pathways_df <- data.frame(matrix(ncol = length(names(patient_analysis_list)) * 2, nrow = 10))

# Set the column names for the data frame
colnames(top_patient_pathways_df) <- paste(rep(names(patient_analysis_list), each = 2), rep(c("Upregulated", "Downregulated"), times = length(names(patient_analysis_list))), sep = "_")

# Fill the data frame with the top pathways
for (cond in names(patient_analysis_list)) {
  # Extract the top 10 upregulated pathways
  top_upreg <- patient_analysis_list[[cond]]$result %>%
    filter(query == "Upregulated", term_size <= 1000) %>%
    arrange(p_value) %>%
    slice(1:10) %>%
    .$term_name

  # Ensure that we have exactly 10 entries by padding with NA if necessary
  top_upreg <- top_upreg[1:10]
  if (length(top_upreg) < 10) {
    top_upreg <- c(top_upreg, rep(NA, 10 - length(top_upreg)))
  }

  # Extract the top 10 downregulated pathways
  top_downreg <- patient_analysis_list[[cond]]$result %>%
    filter(query == "Downregulated", term_size <= 1000) %>%
    arrange(p_value) %>%
    slice(1:10) %>%
    .$term_name

  # Ensure that we have exactly 10 entries by padding with NA if necessary
  top_downreg <- top_downreg[1:10]
  if (length(top_downreg) < 10) {
    top_downreg <- c(top_downreg, rep(NA, 10 - length(top_downreg)))
  }

  # Add the pathways to the data frame
  top_patient_pathways_df[paste(cond, "Upregulated", sep = "_")] <- top_upreg
  top_patient_pathways_df[paste(cond, "Downregulated", sep = "_")] <- top_downreg
}

# Create the kable
kable(top_patient_pathways_df, caption = "Table 11: Top 10 upregulated and downregulated pathways for each patient comparison. Intepretation of these results are in the Interpretation section.", align = 'c')
Table 11: Top 10 upregulated and downregulated pathways for each patient comparison. Intepretation of these results are in the Interpretation section.
results_lvVS60_Upregulated results_lvVS60_Downregulated results_rvVS60_Upregulated results_rvVS60_Downregulated results_laVS60_Upregulated results_laVS60_Downregulated results_raVS60_Upregulated results_raVS60_Downregulated
cellular respiration mitotic cell cycle cellular respiration mitotic cell cycle cellular respiration mitotic cell cycle process oxidative phosphorylation mitotic cell cycle process
aerobic respiration DNA damage response aerobic respiration mitotic cell cycle process aerobic respiration mitotic cell cycle cellular respiration mitotic cell cycle
energy derivation by oxidation of organic compounds mitotic cell cycle process energy derivation by oxidation of organic compounds DNA damage response oxidative phosphorylation cell division aerobic respiration cell division
generation of precursor metabolites and energy cell division generation of precursor metabolites and energy cell division mitochondrial ATP synthesis coupled electron transport DNA damage response respiratory electron transport chain DNA damage response
oxidative phosphorylation DNA repair oxidative phosphorylation DNA repair ATP synthesis coupled electron transport protein-DNA complex organization mitochondrial ATP synthesis coupled electron transport protein-DNA complex organization
respiratory electron transport chain regulation of cell cycle process respiratory electron transport chain regulation of cell cycle process aerobic electron transport chain chromosome organization ATP synthesis coupled electron transport chromosome organization
aerobic electron transport chain cell cycle phase transition ATP synthesis coupled electron transport cell cycle phase transition electron transport chain regulation of cell cycle process electron transport chain regulation of cell cycle process
ATP synthesis coupled electron transport chromosome organization mitochondrial ATP synthesis coupled electron transport chromosome organization respiratory electron transport chain cell cycle phase transition aerobic electron transport chain DNA repair
mitochondrial ATP synthesis coupled electron transport protein-DNA complex organization aerobic electron transport chain protein-DNA complex organization energy derivation by oxidation of organic compounds chromatin organization energy derivation by oxidation of organic compounds chromatin organization
electron transport chain protein modification by small protein conjugation electron transport chain regulation of cellular component biogenesis generation of precursor metabolites and energy DNA repair generation of precursor metabolites and energy cell cycle phase transition

Interpretation

Do the over-representation results support conclusions or mechanism discussed in the original paper?

It’s quite complicated to say whether or not our data supports the conclusions made by the original paper. The reason for this is because the authors were using other datasets to compare differentiation protocols to see whether or not using a Wnt inhibitor while directing differentiation into cardiomyocytes would cause their hPSCs to be more akin to adult LV cardiomyocytes. Our data and analysis is restricted to just this dataset, so all we can comment on is what we found. With this in mind, we can begin to answer this question. Our pairwise day comparisons we indeed see many of the same terms being upregulated in Figure 2C of our article. These include terms like muscle tissue development, actin-mediated cell contraction, and more. Additionally, we also get some similar terms for down-regulation, including ncRNA metabolic process and ribosome biogenesis. Together, we see that our analysis concurs with the research group’s, such that the protocol indeed up-regulated pathways related to creating cardiomyocytes and downregulated ‘general’ and ‘stem-my’ pathways as it differentiated. The other main analysis that we did was compare the most mature hPSC, our day60 one, to patient samples for the LV, RV, LA, and RA. Previously, we saw that there were no pathways that were up/down-regulated for LV and RV suggesting that they are quite similar to each other in terms of transciptome. Across all conditions, we see that pathways related to cellular respiration, oxidative phosphorylation, the mitotic cell cycle, DNA repair and regulation, and chromatin organization are upregulated in the patient samples compared to the hPSC-derived CM. This is likely because these samples came from cells that were from cadavers, and as a result, were cells that were involved in heart function and likely had higher metabolic demands in order to function in the heart.

Can you find evidence, i.e. publications, to support some of the results that you see? How does this evidence support your results?

The article’s main goal was to show that a methodology could be derived to improve the differentiation of hPSCs into cardiomyocytes more effectively, and their choice was to using CHIR to inhibit Wnt. When the authors compared their dataset to Cyganek et al.’s, they actually found that the other group’s day 90 compared favourably to our group’s day 60, indicating that our group’s differentiation protocol was good at making mature cardiomyocytes more quickly. Now, for our analysis, it would have been nice for us to have been able to compare our analysis to the other group’s day 90 to see if we had similar ORA results. Additionally, in a study looking at NOTCH1 knockouts by Ye et al., they also conducted a overrepresentation analysis, and in their day 10 cardiomyocytes, they had a similar result where pathways upregulating heart muscle development were favoured, and pathways involved in the cell cycle were downregulated. Together, this increases the confidence that our results show a robust differentiation protocol of hPSCs in cardiomyocytes.

Answers to Questions

References

Dark, Nicola, Marie-Victoire Cosson, Lorenza I Tsansizi, Thomas J Owen, Elisa Ferraro, Alice J Francis, Selina Tsai, et al. 2023. “Generation of Left Ventricle-Like Cardiomyocytes with Improved Structural, Functional, and Metabolic Maturity from Human Pluripotent Stem Cells.” Cell Rep Methods 3 (4): 100456–56. https://doi.org/https://doi.org/10.1016/j.crmeth.2023.100456.
Davis, Sean, and Paul Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 14: 1846–47.
Gu, Zuguang. 2022. “Complex Heatmap Visualization.” Imeta 1 (3).
Moutsopoulos, Ilias, Lukas Maischak, Elze Lauzikaite, Sergio A Vasquez Urbina, Eleanor C Williams, Hajk-Georg Drost, and Irina I Mohorianu. 2021. “noisyR: Enhancing Biological Signal in Sequencing Datasets by Characterizing Random Technical Noise.” Nucleic Acids Res. 49 (14): e83.
Reimand, Jüri, Meelis Kull, Hedi Peterson, Jaanus Hansen, and Jaak Vilo. 2007. “G:profiler–a Web-Based Toolset for Functional Profiling of Gene Lists from Large-Scale Experiments.” Nucleic Acids Res. 35 (Web Server issue): W193–200.
Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-sequencing and Microarray Studies.” Nucleic Acids Res. 43 (7): e47.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.
Tweedie, Susan, Bryony Braschi, Kristian Gray, Tamsin E M Jones, Ruth L Seal, Bethan Yates, and Elspeth A Bruford. 2020. “Genenames.org: The HGNC and VGNC Resources in 2021.” Nucleic Acids Research 49 (D1): D939–46. https://doi.org/https://doi.org/10.1093/nar/gkaa980.